AITopics | batch size 256

c68c9c8258ea7d85472dd6fd0015f047-Supplemental.pdf

Neural Information Processing SystemsFeb-10-2026, 07:02:04 GMT

experiment, qualitative result, quantitative result, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > California > San Diego County > San Diego (0.05)
North America > Canada (0.05)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.30)

Add feedback

Shortcuts, spurious rules that perform well during training but fail to generalize, present a major challenge to the reliability of deep networks (Geirhos et al., 2020). However, the impact of shortcuts on feature representations remains understudied, obstructing the design of principled shortcut-mitigation methods. To overcome this limitation, we investigate the layer-wise localization of shortcuts in deep models. Our novel experiment design quantifies the layer-wise contribution to accuracy degradation caused by a shortcut-inducing skew by counterfactual training on clean and skewed datasets. We employ our design to study shortcuts on CIFAR-10, Waterbirds, and CelebA datasets across VGG, ResNet, DeiT, and ConvNeXt architectures. We find that shortcut learning is not localized in specific layers but distributed throughout the network. Different network parts play different roles in this process: shallow layers predominantly encode spurious features, while deeper layers predominantly forget core features that are predictive on clean data. We also analyze the differences in localization and describe its principal axes of variation. Finally, our analysis of layer-wise shortcut-mitigation strategies suggests the hardness of designing general methods, supporting dataset- and architecture-specific approaches instead.

artificial intelligence, contribution, machine learning, (17 more...)

arXiv.org Machine Learning

2510.2656

Country:

Europe > Bulgaria > Sofia City Province > Sofia (0.04)
North America > United States > California (0.04)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

. We would like to point out that

Neural Information Processing SystemsOct-2-2025, 09:16:28 GMT

We would like to thank all the valuable and constructive feedback from the reviewers. AdaReg does not explicitly enforce the weight matrices to be positively/negatively correlated. Therefore, our method is orthogonal to but not contradictory with Dropout. Inspired by this result, we explored hyperparameter learning by empirical Bayes. BatchNorm, we do observe that smaller batch size leads to better generalizations.

adareg, experiment, matrix, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.30)

Add feedback

c68c9c8258ea7d85472dd6fd0015f047-Supplemental.pdf

Neural Information Processing SystemsAug-22-2025, 00:47:48 GMT

experiment, qualitative result, quantitative result, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > California > San Diego County > San Diego (0.05)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.05)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.30)

Add feedback

a1d4c20b182ad7137ab3606f0e3fc8a4-Supplemental.pdf

Neural Information Processing SystemsAug-15-2025, 12:56:48 GMT

optimizer adam, padding, stride, (11 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.71)

Add feedback

32ac710102f0620d0f28d5d05a44fe08-Supplemental-Conference.pdf

Neural Information Processing SystemsAug-14-2025, 04:42:20 GMT

batch size, different batch size, log scale, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

On the SDEs and Scaling Rules for Adaptive Gradient Algorithms

Malladi, Sadhika, Lyu, Kaifeng, Panigrahi, Abhishek, Arora, Sanjeev

arXiv.org Artificial IntelligenceFeb-13-2023

Approximating Stochastic Gradient Descent (SGD) as a Stochastic Differential Equation (SDE) has allowed researchers to enjoy the benefits of studying a continuous optimization trajectory while carefully preserving the stochasticity of SGD. Analogous study of adaptive gradient methods, such as RMSprop and Adam, has been challenging because there were no rigorously proven SDE approximations for these methods. This paper derives the SDE approximations for RMSprop and Adam, giving theoretical guarantees of their correctness as well as experimental validation of their applicability to common large-scaling vision and language settings. A key practical result is the derivation of a $\textit{square root scaling rule}$ to adjust the optimization hyperparameters of RMSprop and Adam when changing batch size, and its empirical validation in deep learning settings.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2205.10287

Country: North America > Canada > Ontario > Toronto (0.04)

Genre: Research Report > New Finding (0.45)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.68)

Add feedback

Lessons for Improving Training Performance -- Part 1

#artificialintelligenceOct-8-2019, 09:03:09 GMT

Nine months ago, as part of a joint reference architecture launch with Nvidia, Pure Storage published TensorFlow deep learning performance results. The goal of creating a joint architecture with Nvidia was to identify and solve performance bottlenecks present in an end-to-end deep learning environment -- especially at scale. During creation of our reference architecture, my team identified and improved performance issues across storage, networking, and compute. Our system is a physical entity, and everything from cabling configuration and MTU size to Tensorflow prefetch buffer size can impact performance. The software and hardware stack in our test environment.

batch size, precision, throughput, (14 more...)

#artificialintelligence

Industry: Education (0.35)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Filters

Collaborating Authors

batch size 256

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

c68c9c8258ea7d85472dd6fd0015f047-Supplemental.pdf

a1d4c20b182ad7137ab3606f0e3fc8a4-Supplemental.pdf

32ac710102f0620d0f28d5d05a44fe08-Supplemental-Conference.pdf

On Measuring Localization of Shortcuts in Deep Networks

. We would like to point out that

c68c9c8258ea7d85472dd6fd0015f047-Supplemental.pdf

a1d4c20b182ad7137ab3606f0e3fc8a4-Supplemental.pdf

32ac710102f0620d0f28d5d05a44fe08-Supplemental-Conference.pdf

On the SDEs and Scaling Rules for Adaptive Gradient Algorithms

Lessons for Improving Training Performance -- Part 1